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PREFACE 


This  report  was  prepared  as  part  of  Rand’s  research  project  on 
Computer  Security,  sponsored  by  the  National  Science  Foundation  under 
Grant  No.  MCS76-00720. 

The  growing  use  of  computers  to  store  sensitive,  private,  and 
classified  information  makes  it  increasingly  important  to  be  able  to 
determine  with  a  very  high  degree  of  confidence  the  identity  of  an 
individual  seeking  access  to  the  computer.  This  report  summarizes 
preliminary  efforts  to  establish  whether  an  individual  can  be  iden¬ 
tified  by  the  statistical  characteristics  of  his  or  her  typing. 

The  investigation  was  carried  out  under  the  joint  direction  of 
Stockton  Gaines  and  Norman  Shapiro,  who  are  responsible  for  the  central 
idea  of  using  keystroke  timing  as  the  basis  for  an  authentication  sys¬ 
tem.  They  also  developed  the  textual  material  upon  which  the  experiment 
was  based,  and  they  conducted  the  experiment.  James  Press  developed  the 
statistical  model  for  authentication,  directed  the  analysis  of  the 
experimental  data,  and  drafted  the  report.  William  Lisowski  programmed 
the  authentication  procedure  for  the  computer,  developed  programs  for 
analyzing  the  data,  and  ran  the  data  through  the  routines. 
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SUMMARY 


Can  people  be  identified  by  the  way  they  type?  To  investigate 
this  question,  an  experiment  was  carried  out  at  Rand,  in  which  seven 
professional  typists  were  each  given  a  paragraph  of  prose  to  type,  and 
the  times  between  successive  keystrokes  were  recorded.  This  procedure 
was  repeated  four  months  later  with  the  same  typists  and  the  same  para¬ 
graph  of  prose.  By  examining  the  probability  distributions  of  the 
times  each  typist  required  to  type  certain  pairs  of  successively  typed 
letters  (digraphs) ,  we  found  that  of  the  large  number  of  digraphs  rep¬ 
resented  in  most  ordinary  paragraphs,  there  were  five  which,  considered 
together,  could  serve  as  a  basis  for  distinguishing  among  the  subjects. 
The  implications  of  this  finding  are  that  touch  typists  appear  to  have 
a  typing  "signature,"  and  that  this  method  of  distinguishing  subjects 
might  provide  the  basis  for  a  computer  authentication  system. 
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I .  INTRODUCTION 


This  report  describes  the  preliminary  results  of  an  investigation 
of  the  feasibility  of  using  keystroke  timing  as  the  basis  for  authen¬ 
ticating  individuals  seeking  access  to  sensitive  information  stored  in 
a  computer.  In  many  such  applications,  authentication  might  be  carried 
out  using  software  stored  in  the  computer  itself.  The  fundamental 
question  that  must  be  answered  is.  Do  people  type  in  timing  patterns 
that  are  so  individual  that  one  typist  can  be  distinguished  from 
another,  with  extremely  high  reliability,  on  the  basis  of  their  typing 
"signatures"? 

There  is  some  a  priori  reason  to  believe  that  individuals  type  dif¬ 
ferently  in  a  statistically  significant  way.  For  instance,  it  has  been 
known  that  people  who  use  a  telegraph  key  develop  a  distinctive  "fist" 
or  telegraphic  style  that  can  be  recognized.  Amateur  radio  operators 
can  often  tell  which  of  their  friends  is  transmitting,  before  direct 
identification  is  received.  Moreover,  it  has  been  discovered  that  not 
only  is  the  form  of  an  individual's  written  signature  unique  and  dis¬ 
tinctive,  so  are  other  aspects  of  writing  a  signature.  The  pen  pres¬ 
sure  used  in  producing  the  signature  and  the  acceleration  of  the  pen 
are  variables  that  can  be  measured  and  whose  patterns  can  be  associated 
very  accurately  with  the  signer.  Because  the  act  of  typing  is  mainly 
one  of  involuntary  control  of  finger  movements,  at  least  in  the  case 
of  a  skilled  typist,  we  had  reason  to  hope  at  the  beginning  of  this 
investigation  that  typing  patterns  would  be  both  different  enough  be¬ 
tween  individuals  and  consistent  enough  over  time  that  authentication 
based  on  the  timing  characteristics  of  typing  would  be  feasible. 

To  investigate  the  extent  to  which  typing  signatures  exist,  and 
to  evaluate  whether  or  not  individuals  can  actually  be  authenticated 
on  the  basis  of  them,  we  designed  an  experiment  involving  a  typing 


We  also  examined  the  earlier  efforts  to  analyze  individual 
typing  behavior  reported  by  Coover  (1923) ,  Dvorak  et  al.  (1936)  , 
Harding  (1933),  Lahy  (1924),  Neal  (1977),  Ostry  (1977),  and  Rochester 
et  al.  (1967). 
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"test,"  which  we  administered  to  subjects.  After  analyzing  the  statis¬ 
tical  properties  of  the  subjects*  typing  patterns,  we  developed  a  sta¬ 
tistical  model  for  authenticating  subjects;  we  then  applied  the  model 
to  the  data  from  the  experiment .  The  results  were  sufficiently  promis¬ 
ing  to  suggest  both  that  more  extensive  experimentation  should  be  under¬ 
taken  and  that  the  development  of  the  statistical  model  should  be 
broadened  to  extend  its  applicability. 

The  experiment  is  described  in  Sec.  II.  Briefly,  it  involved  the 
collection  of  samples  of  keystroke  timing  from  seven  individuals  at 
two  different  times,  separated  by  four  months.  However,  only  six  were 
available  for  the  second  data  collection.  The  statistical  model  used 
to  analyze  the  data  is  described  in  Sec.  Ill,  and  the  detailed  analysis 
of  those  data  is  presented  in  Sec.  IV.  The  mathematical  details  of  the 
model  are  given  in  the  Appendix. 

Prior  to  performing  our  detailed  analysis,  we  conducted  the  follow¬ 
ing  informal  experiment:  One  member  of  the  project  staff  was  given  all 
the  data,  with  the  names  of  the  individuals  removed.  The  data  consisted 
of  each  individual's  average  time  for  typing  each  digraph,  i.e.,  each 
pair  of  letters  typed  successively  in  a  text. 

This  person  then  tried  to  match  the  data  from  the  first  period  with 
those  from  the  second  period  on  an  individual-by- individual  basis.  He 
was  able  to  do  this  with  100  percent  success;  he  was  even  able  to 
identify  the  set  of  data  from  the  individual  who  took  the  test  the  first 
time  but  was  not  present  for  the  second  session.  The  comparison  was 
simply  performed  by  eye,  without  using  any  sort  of  formal  analysis  rou¬ 
tines.  This  result  considerably  strengthened  our  hypothesis  that  in¬ 
dividual  typing  characteristics  are  substantially  different  between  in¬ 
dividuals  . 

There  are,  of  course,  many  ways  in  which  a  "signature"  might  occur 
in  an  individual's  typing  patterns.  We  might  have  looked,  for  example, 
at  the  time  to  type  entire  words,  entire  sentences,  or  entire  paragraphs. 
However,  we  chose  to  examine  digraphs,  because  they  seemed  the  most 
elemental  typing  units.  Future  analyses  might  explore  the  potential  of 
using  other  data  for  authentication.  The  success  we  achieved  with 


-3- 


digraphs  strengthened  our  belief  that  they  are  useful  for  authentication, 
but  we  have  by  no  means  ruled  out  the  possibility  that  other  measures 
might  be  even  more  useful. 


II.  THE  EXPERIMENT 


Our  experiment  on  keystroke  timing  involved  having  six  touch  typists 
(professional  secretaries  at  Rand)  type  each  of  three  specially  pre- 

JU 

pared  texts."  They  were  then  asked  to  repeat  this  task  four  -months 
later,  using  precisely  the  same  texts.  We  were  thus  able  to  study  vari¬ 
ations  across  people  who  took  the  same  test  at  the  same  time,  and  we 
could  also  study  typing  consistency  for  a  given  individual  typing  the 
same  text  at  a  later  time.  Two  of  the  six  typists  studied  were  left- 
handed  and  four  were  right-handed. 

The  three  texts  are  reproduced  in  Figs.  1  through  3,  The  first 
(Text  1)  was  designed  to  read  as  ordinary  English  text;  the  second 
(Text  2)  is  a  collection  of  "random"  English  words;  and  the  third 
(Text  3)  is  a  collection  of  "random"  phrases.  We  originally  hoped  to 
be  able  to  make  separate  conclusions  about  how  individuals  differ  in 
their  typing  of  the  three  kinds  of  textual  material.  As  it  turned 
out,  however,  there  was  insufficient  information  in  any  one  of  the 
texts  to  permit  statistical  inferences  to  be  drawn  from  that  text 
alone.  Therefore,  we  pooled  the  information  in  the  three  texts,  so 
our  data  base  was  developed  by  using  the  three  texts  as  if  they  were 
one  long  continuous  text. 

The  typing  keyboards  were  part  of  a  PDP-11/45  computer  system. 

A  timer  was  installed  within  the  system  to  record  the  time  at  which 
each  key  was  struck.  A  small  program  then  calculated  the  time  between 
each  pair  of  successive  letters,  or  digraphs.  The  time  between  suc¬ 
cessive  letters  is  referred  to  as  the  "digraph  time."  Thus,  the  time 
it  takes  to  type  io  is  one  digraph  time,  and  the  time  to  type  on  is 
another.  (Although  we  have  so  far  analyzed  only  digraph  times,  we  can 
envision  using  trigraphs  such  as  i-on  or  tetragraphs  such  as  t/on,  as 
well.)  The  digraphs  we  have  considered  involve  only  lower-case  letters 
and  spaces;  upper-case  letters,  carriage  returns,  punctuation,  and 


There  were  originally  seven  subjects,  but  one  was  not  available 
to  complete  the  experiment. 


Moist  Americans  now  do  at  least  some  of  their  buying  on  credit  and  most 
have  some  form  of  life,  health,  property  or  liability  insurance. 
Institutionalized  medical  care  is  almost  universally  available.  Govern¬ 
ment  social  services  programs  now  reach  deep  into  the  population  along 
with  government  licensing  of  occupations  and  professions,  federal  taxa¬ 
tion  of  individuals,  and  government  regulation  of  business  and  labor 
union  affairs.  Today  government  regulates  and  supports  large  areas  of 
economic  and  social  life  through  some  of  the  nation's  largest  bureau¬ 
cratic  organizations,  many  of  which  deal  directly  with  individuals. 

In  fact,  many  of  the  private  sector  record  keeping  relationships  dis¬ 
cussed  in  this  report  are  to  varying  degrees  replicated  in  programs 
administered  or  funded  by  federal  agencies. 

A  significant  consequence  of  this  marked  change  in  the  variety  and 
concentration  of  institutional  relationships  with  individuals  is  that 
record  keeping  about  individuals  now  covers  almost  everyone  and  influ¬ 
ences  everyone's  life,  from  the  business  executive  applying  for  a 
personal  loan  to  the  school  teacher  applying  for  a  national  credit 
card,  from  the  riveter  seeking  check  guarantee  privileges  from  the 
local  bank  to  the  young  married  couple  trying  to  finance  furniture  for 
their  first  home.  All  will  have  their  creditworthiness  evaluated  on 
the  basis  of  recorded  information  in  the  files  of  one  or  more  organi¬ 
zations.  So  also  with  insurance,  medical  care,  employment,  education, 
and  social  services.  Each  of  those  relationships  requires  the  indi¬ 
vidual  to  divulge  Information  about  himself,  and  usually  leads  to 
some  evaluation  of  him  based  on  information  about  him  that  some  other 
record  keeper  has  compiled. 


Fig.  1  —  Sample  1 
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plasma  wring  fork  gnome  twitch  vapor  proms  doze  half  blur  whimper 
fib  fuzzy  eggnog  docent  wry  placard  gyp  pablum  duffle  twenty 
extract  wheeze  ward  churn  endurable  bystander  legible  avid  razz 
vivisect  swat  hull  smirk  paams  type  active  keys  lyse  skirmish 
frenzy  fox  extra  hubby  swamp  excite  skies  keg  stanza  pun  kill 
form  sweaty  foxy  half  smuggler  lava  excise  under  duffer  fuzzy 
active  churn  smirk  half  form  exise  twitch  under  docent  legible 
extract  wheeze  ward  pablum  wring  doze  smuggler  keys  skirmish 
bystander  gnome  endurable  swamp  plasma  vapor  avid  half  frenzy 
stanza  placard  prams  vivisect  keg  fork  gyp  sweaty  pun  skies  blur 
eggnog  razz  type  swat  lyse  hubby  excite  kill  duffle  foxy  lava  wry 
fib  proms  hull  fox  extra  twenty  whimper  duffer  pun  form  ward 
churcn  fork  eggnog  plasma  skirmish  endurable  razz  active  foxy  swat 
excite  vivisect  twenty  placard  fuzzy  wheeze  fox  smuggler  avid 
hull  fib  type  docent  bystander  prams  blur  pablum  doze  lyse 
extract  duffer  keys  vapor  duffle  under  skies  wry  whimper  swamp 
kill  smirk  twitch  keg  frenzy  sweaty  hubby  excise  stanza  gyp  half 
proms  lava  gnome  wring  half  legible  extra  keys  frenzy  extract 
swamp  kill  smuggler  wring  gyp  plasma  bystander  vivisect  half 
active  under  wheeze  stanza  skies  hubby  placard  type  fuzzy 
endurable  legible  duffer  twenty  doze  skirmish  pablum  docent  foxy 
vapor  ward  blur  eggnog  pun  proms  fox  excite  lyse  half  twitch 
duffle  lava  sweaty  form  avid  prams  smirk  fork  whimper  keg  gnome 
hull  extra  churn  excise  wry  swat  fib  razz  eggnog  duffer  half 
excite  pun  type  placard  bystander  smuggler  hull  endurable  frenzy 
half  keys  skies  legible  hubby  fork  fib  blur  twitch  swat  skirmish 
swamp  wheeze  gnome  active  gyp  razz  lyse  extract  duffle  ward  smirk 
whimper  excise  prams  avid  proms  wry  fuzzy  stanza  vapor  under  doze 
form  pablum  twenty  docent  lava  plasma  vivisect  wring  sweaty  foxy 
churn  extra  kill  fox  keg 


Fig.  2  —  Sample  2 
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This  typing  exercise  is  a  strange  jumble  of  awkward  phrases, 
representing  the  quintessence  of  exquisite  digraphs  dictated  by  a 
foreign  midget.  It  is  a  plethora  of  puzzling  words  under  the  guise 
of  psychological  authentication  policy,  although  you  may  perceive 
it  quizzically  as  an  ambiguous  wasteful  plot  to  overcome  summertime 
melancholy.  Your  vituperations  against  this  phenomenon  will  add  to 
the  dense  psychodrama  in  which  this  impossible  business  is  entwined. 
The  hyphenated  rhythms  of  this  ridiculous  nightmare  may  elicit 
smothered  teardrops  as  well  as  excited  little  laughs.  The  psychotic 
excesses  may  lead  to  indefinite  suspense  or  mumbling  traditional 
sayings,  or  may  just  produce  a  kind  of  loud  ringing  in  the  ether. 
Whatever  the  consequence,  enough  mystic  bifurcations  dangled  and 
untried  will  decimate  the  ranks  of  all  but  the  most  adventurous  or 
mercenary.  If  it  is  rough,  pound  it;  if  lousy,  fight  it.  All  is 
fair  in  cybernetic  war  if  plotted  smartly. 

The  English  jury  snapped  under  the  known  betrayal,  but  still 
sent  the  European  ragamuffin  to  the  penitentiary.  The  earthenware 
was  made  from  black  milk,  pounded  to  a  chalky  consistency.  The 
sedentary  safecracker  succeeded  by  using  a  lubricated  blue  pencil. 
He  would  swear  that  a  snafu  was  unsynchronized,  although  fencing 
at  a  high  altitude  was  crass.  The  excluded  sex  rarely  chuckles 
unless  judiciously  engaged  in  schoolwork.  The  phenomenal  pansies 
growing  aside  the  softball  mound  were  fortuitous  twins.  Stubble  in 
bulk  should  be  checked.  The  suspected  lubber  did  not  wear  sable 
onto  the  frigate.  A  blank  lethargy  results  from  a  lackadaisical 
twiddling.  If  you  are  dumbfounded,  you  may  quit  and  go  dancing 
or  bicycle  on  the  promenade. 


Fig.  3  — Sample  3 
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special  characters  have  been  ignored  because  of  their  relative  infre¬ 
quency  in  typewritten  material.  The  digraph  times  ranged  from  a  mini¬ 
mum  of  about  75  milliseconds  to  a  maximum  of  several  seconds  (times 
were  recorded  to  an  accuracy  of  within  1  millisecond) .  The  extremely 
high  values  probably  represented  some  external  interruption  of  the 
typing  task.  The  typical  digraph  time  was  around  125  milliseconds. 

Once  we  started  analyzing  the  digraph  times,  it  became  clear  that 
in  future  experiments  we  could  avoid  certain  problems  by  building  into 
our  experimental  texts  a  certain  minimum  number  of  replications  of 
"important"  digraphs;  moreover,  we  would  try  to  make  the  texts  used 
in  multiple-text  experiments  more  unlike  one  another  than  those  used 
in  the  initial  experiment.  Finally,  we  would  use  a  larger  number  of 
subjects  in  subsequent  experiments. 


-9- 


III.  THE  STATISTICAL  MODEL 


INTRODUCTION 

The  statistical  model  we  have  adopted  assumes  that  a  person  (called 
the  "originator")  who  will  later  desire  to  gain  access  to  a  computer 
types  some  predesignated  text  into  the  computer,  which  then  retains  in¬ 
formation  regarding  the  keystroke- typing  time.  Later,  another  person 
(called  the  "claimant")  who  wishes  access  to  the  computer  and  who  makes 
a  claim  to  being  the  originator  is  asked  by  the  computer  to  type  in 
another  predesignated  text.  The  computer  must  now  compare  the  keystroke¬ 
typing  time  patterns  of  the  claimant  with  those  of  the  originator.  If 
the  two  are  the  same,  at  least  in  terms  of  their  statistical  character¬ 
istics,  then  a  system  based  upon  our  model  will  authenticate  the  claimant 
as  being  the  same  person  as  the  originator;  if  the  patterns  do  not  match, 
the  system  will  not  authenticate  the  claimant  and  will  not  allow  him  to 
log  on. 

An  authentication  system  can'make  two  types  of  error:  a  "primary" 
error,  in  which  an  unauthorized  person  (impostor)  is  granted  access  to 
the  computer;  and  a  "secondary"  error,  in  which  the  system  fails  to 
give  access  to  an  authorized  person.  While  the  terms  "primary"  and 
"secondary"  are  of  course  arbitrary,  a  primary  error  would,  in  most 
contexts,  be  much  worse  than  a  secondary  error.  (An  exception  would 
be  the  case  in  which  a  decisionmaker,  such  as  an  army  general,  must 
issue  counterattack  commands  immediately,  in  response  to  an  attack, 
and  he  must  do  it  through  a  computer.  If  the  computer  security  system 
fails  to  authenticate  him  and  denies  him  access,  precious  minutes  are 
lost  while  the  general  tries  to  get  his  counterattack  started.) 

The  hypothetical  authentication  system  considered  here  is  based 
upon  a  statistical  model  that  uses  the  classical  theory  of  hypothesis 
testing.  The  basic  ideas  behind  classical  hypothesis  testing  have 
been  amply  described  elsewhere,  so  they  will  not  be  repeated  here.  We 
will  build  and  draw  upon  them,  however. 

In  the  authentication  problem,  we  will  use  H  to  denote  the  hypoth¬ 
esis  that  the  claimant  and  the  originator  are  the  same  person,  and  A 
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will  denote  the  hypothesis  that  the  claimant  and  the  originator  are 
different  persons,  i.e.,  that  the  claimant  is  an  impostor. 

We  test  H  versus  A  in  terms  of  the  significance  level  of  the  test, 
which  is  normally  written, 

a  =  P{rej.  h|h}  =  P{making  a  secondary  error}. 

The  probability  of  making  the  other  kind  of  error  is  given  by 

8  =  P{rej.  a|a}  =  P{making  a  primary  error}. 

In  many  problems  of  inference,  a  is  taken  to  be  .01,  05,  or  .10.  We 
will  also  work  in  this  range. 

Ideally,  we  should  attempt  to  simultaneously  minimize  a  and 
but  unfortunately,  we  cannot  reduce  one  without  increasing  the  other. 

In  keeping  with  normal  statistical  practice,  therefore,  we  will  fix  a 
in  advance  at  some  tolerably  low  level  and  try  to  keep  g  as  small  as 
possible. 

In  our  problem,  we  will  use  a  test  statistic  U  that  reflects  the 
difference  in  keystroke  patterns  between  the  originator  and  the 
claimant.  If  the  two  individuals  are  the  same  person,  U  should  be 
small  (i.e.,  not  significant,  reflecting  only  random  sampling  varia¬ 
tion),  and  we  should  not  want  to  reject  H.  Therefore,  the  p-value 
corresponding  to  an  observed  U  should  be  large  (_>  .05).  If  in  fact  the 
p-value  is  small,  we  generate  a  secondary  error. 

Alternatively,  suppose  the  originator  and  the  claimant  are  dif¬ 
ferent  persons.  In  this  case,  U  should  be  large  (significant),  and 
we  should  want  to  reject  H.  Therefore,  the  p-value  should  be  small 
(<  .05).  If  in  fact  the  p-value  is  large,  we  generate  a  primary  error. 
These  concepts  are  summarized  in  Table  1. 

We  derived  the  test  procedure  for  our  problem  on  the  basis  of  a 
classical  likelihood  ratio  test.  The  procedure  is  summarized  below; 
the  technical  details  of  the  derivation  are  given  in  the  Appendix. 
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Table  1 

ERROR  CONCEPTS 


p -value  Large 

p-value  Small 

(accept  H) 

(reject  H) 

H  is  true: 

originator  and 
claimant  are  the 

No  error 

Secondary  error 

same  person 

H  is  false: 

originator  and 
claimant  are 
different  persons 

Primary  error 

No  error 

AUTHENTICATION  EQUATIONS  AND  PROCEDURE 

The  subject  whose  keystroke  typing  patterns  are  being  evaluated 
(either  claimant  or  originator)  is  asked  to  type  a  paragraph  of  prose, 
and  the  computer  records  the  time  between  all  successive  keystrokes. 

For  a  judiciously  selected  group  of  digraphs,  the  authentication  pro¬ 
cedure  will  compare  the  digraph  times  from  the  claimant's  sample  with 
those  from  the  originator's  sample. 

For  example,  the  originator  types  the  digraph  th  ten  times  in  some 
nonrepetitive,  prose  context  (to  avoid  "learning"),  with  a  mean  digraph 
time  of  85  milliseconds  and  a  standard  deviation  of  5  milliseconds. 

The  claimant  then  types  the  th  digraph  15  times,  with  a  mean  digraph 
time  of  150  milliseconds  and  a  standard  deviation  of  10  milliseconds. 

In  this  case,  it  seems  likely  that  the  claimant  is  an  impostor. 

The  raw  data  collected  in  any  real  situation  are  likely  to  show 
that  digraph  times  for  a  specific  digraph  are  roughly  log-normally 
distributed  (see  Sec.  IV).  Thus,  their  logarithms  are  approximately 
normally  distributed.  We  assume  in  the  authentication  equations  that 
the  variables  created  by  transformation  from  the  raw  data  are  approxi¬ 
mately  normally  distributed. 

We  work  simultaneously  with  r  distinct  digraphs,  each  of  which  is 
assumed  to  be  typed  M  times  by  the  originator  and  N  times  by  the 
claimant.  In  fact,  because  of  typing  errors,  subjects  tended  to  type 
different  numbers  of  replications  of  a  given  digraph.  For  example, 
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if  one  typist  inadvertently  omitted  a  word  that  included  a  th,  while 
all  of  the  others  made  no  errors  involving  a  th,  that  typist  would 
have  one  fewer  replication  for  th  than  the  others.  For  purposes  of 
analyzing  the  text  obtained  from  an  originator,  we  selected  the  first 
M  replications  of  a  given  digraph  in  the  text  Cfor  the  claimant,  we 
selected  the  first  N) ,  and  we  ignored  the  remainder.  M  and  N  were  de¬ 
termined  as  the  smallest  number  of  replications  that  occurred  for  all 
r  digraphs .  Thus ,  if  there  were  three  digraphs  to  be  considered  for 
the  originator  and  one  was  replicated  12  times,  another  15  times,  and 
the  third  15  times,  we  would  select  M=12,  because  there  were  at  least 
12  replications  in  all  three  (and  the  statistical  model  requires  an 
equal  number  for  all  digraphs) ,  We  would  then  select  for  analysis  the 
first  12  occurrences  of  each  of  the  three  types  of  digraphs  in  the 
originator’s  text. 

We  assume  that  the  M+N  digraph  times  for  each  of  the  r  digraphs 
(that  is,  (M+N)r  distinct  times)  are  mutually  independent.  We  know, 
of  course,  that  this  assumption  is  not  strictly  true,  but  we  adopt  it 
for  simplicity  as  a  first  approximation  to  see  if  a  system  can  even¬ 
tually  be  developed  around  it.  Clearly,  the  third  time  a  th  is  typed 
in  no  way  influences  (or  is  influenced  by)  the  fourth  time  a  th  is 
typed  by  the  same  person;  nor  is  there  generally  any  natural  way  to 
pair  the  digraph  times  for  any  particular  pair  of  digraphs  (identical 
or  not)  . 

We  assume  that  the  distribution  of  the  time  required  to  type  a 
particular  digraph  has,  after  transformation  to  normality,  the  same 
variance  for  both  originator  and  claimant.  That  is,  the  variance  of 
the  transformed  digraph  time  distribution  for  a  th  will  be  taken  to 
be  the  same  for  both  originator  and  claimant,  although  variances  for 
different  digraphs  such  as  th  and  he  are  permitted.  The  mean  digraph 
times  are  of  course  permitted  to  differ  from  one  another,  both  across 
digraphs  and  between  claimant  and  originator;  in  fact,  the  test  of 
hypotheses  H  versus  A  will  be  carried  out  on  the  basis  of  how  the  mean 
digraph  times  of  claimant  and  originator  compare.  The  assumption  of 
equal  variances  for  claimant  and  originator  made  above  is  justifiable 
on  the  basis  of  the  well-known  metatheorem  in  statistical  theory: 
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Tests  for  equality  of  means,  under  normality,  are  fairly  insensitive 
to  violations  of  the  assumption  of  equal  variances.  This  is  a  robust¬ 
ness  property  of  Student  t-tests.  Thus,  the  statistical  model  for 
authentication  basically  involves  testing  the  hypotheses  that  the  mean 
vectors  (vectors  of  mean  digraph  times),  for  two  multivariate  normal 
populations  are  or  are  not  the  same,  assuming  that  the  two  populations 
have  the  same  diagonal  covariance  matrix  (diagonal,  because  the  digraph 
times  are  assumed  to  be  independent).  A  likelihood  ratio  test  is 
carried  out  to  develop  an  appropriate  test  statistic,  and  it  is  found, 
not  surprisingly,  that  the  test  statistic  is  a  function  of  the  cor¬ 
responding  Student  t-statistics  for  each  of  the  digraphs.  In  fact, 
the  test  consists  of  adding  1  to  the  Student  t-statistic  for  each  di¬ 
graph,  then  multiplying  all  of  them  together.  A  monotone  function  of 
this  product  is  tested  for  significance. 

EXTENSIONS  OF  THE.  MODEL 

Extensions  of  this  statistical  model  could  conceivably  involve 
development  of  models  that  permit  different  numbers  of  replications 
for  different  digraphs,  unequal  variances  for  the  distributions  of 
digraph  times  for  claimant  and  originator,  correlations  of  times  for 
distinct  digraphs,  and  perhaps  a  better  approximation  to  the  distribu¬ 
tion  of  a  product  of  independent  beta  variates  than  the  one  developed 
in  the  Appendix.  Such  extensions  could  increase  the  flexibility  of 
an  eventual  authentication  system  and  might  improve  the  precision  of 
such  a  system  by  providing  statistical  tests  that  are  more  powerful 
and  make  fewer  errors.  We  might  also  develop  a  measure  of  sensitivity 
of  the  authentication  tests  based  upon  the  notion  of  "power"  of  a  test 
of  hypotheses.  We  are  considering  an  alternative  model  in  which  the 
parameters  of  the  originator’s  digraph  distributions  are  assumed  to  be 


It  is  clearly  important  to  test  this  assumption.  A  fundamental 
problem,  however,  is  that  there  is  no  natural  pairing  of  digraphs  that 
will  permit  us  to  compute  the  sample  correlation  of  digraph  times 
across  N  pairs.  Alternatively,  we  computed  sample  correlations  across 
the  first  occurring  sets  of  pairs  for  a  great  many  digraphs.  In  all 
such  cases,  the  correlations  were  not  significant  at  the  5  percent 
level  of  significance. 
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known,  because  the  schema  we  envision  should  permit  us  to  obtain  large 
numbers  of  replications  of  digraphs  for  the  originator,  although  prob¬ 
ably  not  for  the  claimant. 
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IV.  STATISTICAL  ANALYSIS 


BACKGROUND 

The  data  collected  In  the  experiment  include  typescripts  of  three 
structured  texts,  typed  on  two  different  occasions  by  six;  touch  typists. 
The  dates  on  which  the  experiment  was  administered,  August  16  and 
December  14,  1977,  were  four  months  apart.  Six  subjects  participated 
in  the  experiment,  but  all  of  them  did  not  type  all  three  texts  each 
time.  Typist  2  failed  to  type  Text  3  in  December;  Typist  3  failed  to 
type  Texts  2  and  3  in  August;  and  Typist  6  failed  to  type  Text  2  in 


August. 

The  missing 

data  are  summarized  in 

Table  2. 

Table  2 

MISSING  EXPERIMENTAL 

DATA 

August  Session 

December 

Session 

Typist’s 

Text  1  Text  2  Text  3 

Text  1 

Text 

2  Text  3 

Typist 

Handedness 

1 

Left 

—  ___ 

_ 

_ _ 

2 

Right 

— 

— 

— 

X 

3 

Right 

X  X 

— 

— 

— 

4 

Left 

™  -  - 

— 

— 

_ 

5 

Right 

— 

— 

— 

6 

Right 

X 

' - 

— 

— 

The  times  for  all  digraphs  in  each  of  the  texts  were  recorded  at 
both  sessions.  The  first  question  we  addressed  in  the  analysis  of  the 
data  was.  What  is  the  distribution  of  digraph  times  for  a  given  sub¬ 
ject,  for  a  given  digraph,  both  in  August  and  in  December? 

We  began  by  developing  computer  plots  of  the  histograms  associated 
with  each  case,  A  sample  of  the  histogram  plots  is  given  in  Fig.  4. 
Each  histogram  is  labeled  with  four  codes:  The  first  code  indicates 
the  number  of  the  subject  (1-6),  the  number  of  the  text  typed  (1-3), 
and  whether  the  test  was  taken  in  August  (1)  or  in  December  (2).  The 
second  code  is  the  digraph.  Those  entries  that  include  a  dash  (such  as 
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Fig.  4  —  Histogram  plots  of  digraph  times 
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e—  or  -a)  indicate  that  the  letter  shown  may  be  paired  with  any  other 
character.  The  third  code  represents  the  number  of  replications  of 
the  digraph  plotted  in  the  histogram,  and  the  fourth  entry  gives  the 
scaling  for  the  vertical  scale  in  each  plot  (for  example,  a  3  indicates 
that  each  X  in  the  histogram  represents  three  replications,  except  per¬ 
haps  for  the  topmost  X  in  each  column,  which  may  represent  one,  two,  or 
three  replications).  Thus,  the  first  histogram  in  Fig.  4  represents 
the  performance  of  Typist  2  on  Text  1  in  August;  26  replications  of  the 
digraph  al  are  plotted,  with  each  X  representing  three  replications. 

The  histogram  on  the  second  line  laheled  ”212  g-  33  2"  indicates  that 
an  e  was  followed  by  some  other  character  33  times  in  the  sample. 

The  horizontal  scale  shows  digraph  time,  measured  in  25-millisecond 
intervals,  starting  with  50  milliseconds.  Thus,  the  first  column  in 
each  histogram  shows  the  number  of  times  the  given  digraph,  was  typed 
in  50  to  74  milliseconds;  the  next  column  is  for  75  to  99  milliseconds; 
etc.  The  rightmost  column  indicates  digraph  times  of  400  milliseconds 
and  above. 

We  hypothesized  that  the  large  tail  in  the  distribution  was  caused 
by  the  typist  sneezing,  pausing,  or  whatever,  while  typing  some  digraphs. 
Accordingly,  we  removed  all  digraph  times  exceeding  500  milliseconds 
from  the  data,  then  reexamined  the  histograms.  We  still  found  long 

tails  in  the  distribution,  so  we  took  the  logarithm  of  all  digraph 

& 

times  and  replotted  the  histograms  in  terms  of  the  logged  data  (ex¬ 
cluding  the  digraph  times  exceeding  500  milliseconds) .  These  histograms 
tended  to  look  much  more  normally  distributed  than  any  of  the  previous 
plots  (although  this  was  not  true  in  all  cases).  The  data  obtained  by 
removing  the  outlying  digraph  times  and  taking  logs  of  all  remaining 
observations  will  hereafter  be  referred  to  as  the  transformed  data. 

Now  that  the  transformed  data  at  least  "looked"  normally  distri¬ 
buted,  we  proceeded  to  check  further  into  how  far  the  distributions 


The  log  transformation  is  a  special  case  of  the  more  general 
class  of  so-called  Box-Cox  transformations,  used  to  induce  normality 
of  the  transformed  data  (the  more  general  class  also  includes  power 
transformations).  We  decided,  however,  to  ignore  the  possibility  of 
achieving  even  better  fits  to  normality  with  such  transformations  be¬ 
cause  of  the  exploratory  nature  of  the  analysis. 
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actually  deviated  from  normality.  We  computed  the  first  four  sample 
moments  of  each  set  of  the  transformed  data  and  then  evaluated  the 
mean,  variance,  skewness,  and  kurtosis.  The  skewness  is  defined  as 
the  third  central  moment  divided  by  the  variance  raised  to  the  power 
3/2.  The  kurtosis  is  computed  by  subtracting  3  from  the  fourth  central 
moment  divided  by  the  squared  variance.  Both  the  skewness  and  the 
kurtosis  are  zero  in  a  normal  distribution. 

An  illustrative  collection  of  sample  moments  is  shown  in  Fig.  5, 

for  the  case  of  Typist  6  (recall  that  612  denotes  Typist  6  typing  Text 

1  in  December) ;  n  denotes  the  number  of  replications  of  each  digraph. 
Inference  regarding  the  population  values  was  carried  out  as  follows: 
Let 

0  =  skewness 

<t>  =  kurtosis 

where 

Pk  E  E(X  -  EX)k,  k  =  2,3,4  , 

and  E  denotes  expected  value  of  a  random  variable. 

It  has  been  shown  (see  Cramer,  1946)  that  for  large  sample  sizes, 

n,  assuming  X  is  normally  distributed,  it  is  approximately  true  that 

0  N(0,xJ),  $  v  N(0,T2>  , 

where 

2  _  6  (n-2)  ^  6^ 

T1  ~  (n+1)  (n+3)  n  ’ 

2  =  24n(n-2) (n-3)  ^  24 

^  (n+1) ^ (n+3) (n+5)  n 


/  3/2 
=  y3/y2 

—  ( M  A  /  V^)  ”  3  9 
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Case 

(typist/test) 

Digraph 

n 

Mean 

Variance 

Skewness 

Kurtosis 

612 

s- 

27 

4.702 

0.056 

-1.223 

0.986 

612 

th 

29 

4.930 

0.020 

1.109 

5.379 

612 

-a 

26 

4.921 

0.027 

0.950 

2.279 

612 

-t 

27 

4.772 

0.086 

3.085 

9.404 

622 

er 

25 

5.004 

0.023 

-0.031 

-0.611 

622 

e- 

46 

4.928 

0.037 

0.082 

0.523 

622 

r- 

32 

4.664 

0.092 

-0.120 

-0.777 

622 

y- 

27 

5.072 

0.068 

1.097 

0.739 

622 

-f 

27 

4.966 

0.057 

0.429 

-0.759 

622 

-s 

36 

5.024 

0.053 

2.477 

8.730 

631 

d- 

25 

4.744 

0.049 

-1.422 

2.257 

631  - 

en 

25 

4.924 

0.098 

1.632 

1.732 

631 

e- 

44 

4.715 

0.055 

0.543 

0.957 

631 

he 

30 

4.803 

0.040 

0.988 

1.122 

631 

in 

28 

4.912 

0.066 

1.672 

3.069 

631 

s- 

28 

4.791 

0.200 

-0.543 

1.145 

631 

th 

26 

4.964 

0.015 

0.994 

1.283 

631 

-a 

27 

4.824 

0.050 

1.080 

1.837 

631 

-t 

25 

4.800 

0.068 

0.872 

1.363 

632 

en 

25 

4.974 

0.028 

1.269 

0.807 

632 

e- 

43 

4.826 

0.070 

0.526 

2,283 

632 

he 

29 

4.944. 

0.070 

1.009 

0.259 

632 

in 

25 

4.896 

0.030 

0.826 

0.513 

632 

s- 

28 

4.823 

0.043 

0.115 

0.644 

632 

th 

26 

5.005 

0.047 

1.661 

3.357 

Fig.  5  —  Moments  of  transformed  data 
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0  =  y^/ia^2.  <(>  =  (y^/yip  "  3 


n 


Wk  =  n  j^l(Kj  "  ^)k’  k=2>3’4- 


A  caret  over  a  quantity  denotes  its  value  estimated  by  replacing  popula¬ 
tion  quantities  by  sample  quantities.  Thus,  for  n  =  25,  for  example, 
since  the  95  percent  fractile  for  rejection  of  normality  is  2,  we 
should  reject  at  the  5  percent  level  of  significance  if 

1 0 1  >  .88  ,  or  |(J>|  >  1.5  . 

Figure  5  shows  that  although,  many  sample  skewness  values  exceed  .88  and 
many  sample  kurtosis  values  exceed  1.5,  the  actual  sample  values  are 
not  substantially  different  from  the  critical  values  of  .88  and  1,5, 
respectively.  That  is  to  say,  while  the  distributions  of  the  trans¬ 
formed  variables  are  clearly  not  normal,  they  appear  to  be  approxi¬ 
mately  so.  The  same  conclusion  holds  for  all  cases,  including  those 
not  shown.  Therefore,  we  decided  to  go  forward  on  the  assumption  that 
the  transformed  data  were  normally  distributed. 

Figure  6  shows  how  the  distributions  compared  with  one  another 
when  all  three  texts  were  combined  (i.e.,  digraph  times  were  pooled). 

The  plots  for  each  of  the  typists  were  developed  for  the  digraph  th. 
There  were  at  least  nine  replications  of  each  case.  While  the  mean 
values  of  the  logged  digraph  times  tend  to  differ  from  one  another, 
the  variances  tend  to  be  fairly  constant. 

CONSISTENCY  OF  TYPING  PATTERNS  OVER  TIME 

The  question  of  whether  or  not  an  individual's  typing  pattern 
changes  over  time  is  a  central  consideration  in  determining  the  feasi¬ 
bility  of  an  authentication  method  based  on  keystroke  timing. 

To  investigate  typing  consistency  over  time,  we  studied  each  di¬ 
graph  separately.  For  a  given  typist  and  a  given  digraph,  there  was 
a  set  of  mutually  independent  replications  available  from  the  August 
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Fig.  6  —  Comparison  of  digraph -time  distributions 
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test,  and  another  set  from  the  December  test.  (The  frequencies  of  rep¬ 
lication  of  a  given  digraph  for  the  two  tests  differ  occasionally 
because  of  typing  errors.) 

We  took  the  transformed  data  as  the  basic  variables,  assumed  the 
variances  of  the  transformed  digraph  times  were  the  same  for  both  sets 
(we  adopted  the  concept  of  test  robustness  explained  above),  then 
carried  out  a  classical  two-sample  Student  t-test  of  the  hypothesis 
that  the  means  of  the  transformed  digraph  times  were  the  same.  The 
analysis  is  given  below. 

Suppose  a  given  digraph  a  is  replicated  by  a  given  typist  j3 ,  M 
times  in  August  and  N  times  in  December  (these  are  frequencies  obtained 
after  removing  digraph  times  exceeding  500  milliseconds) .  Let  the 
logged  values  of  the  digraph  times  in  August  be  denoted  by  X^,...,X^, 
and  the  corresponding  logged  December  times  be  denoted  by  Y^, . . . ,Y^. 

The  X/s  and  the  Y  ’  s  are  assumed  mutually  independent,  and  indepen¬ 
dent  of  one  another.  Since  we  purposely  logged  the  data  in  order  to 
induce  normality  as  an  approximating  distribution,  and  because  the  sample 
variances  in  August  and  December  are  approximately  the  same,  it  is  rea¬ 
sonable  to  assume  that 

X±  ^  N(01,a2),  Y  v  N(02,a2), 

i=l,...M;  j=l,...,N,  The  problem  is  to  test  the  hypothesis  H:{0  =6„, 

2  2  J-  z 

a  >0}  versus  the  alternative  hypothesis  A:{0^02>  cr  >0}.  If  H  is  true, 

it  implies  that  $*s  typing  has  not  changed  significantly  over  the  four- 
month  period,  insofar  as  digraph  a  is  concerned.  The  classical  (uni¬ 
formly  most  powerful  unbiased)  test  of  H  versus  A  is  to  form  the  t— 
statistic 


t* 


X  -  Y  _ 

-  2  N  -  2V/2 

(X.-X)  +  z  (Y.-Y)  I 
1  .1=1  J  } 
M+N-2  1 
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where 


1  1-  1M  -  1N 
H  =  m  +  N’  X  =  «  Z  X^>  Y  =  *  E  Y^’ 


M  “  N  1  3 


and  test  it  for  significance,  using  the  fact  that  under  H, 


t*  ^  t 


M+N-2  * 


that  is,  under  H,  t*  follows  a  Student  t-distribution  with  M+N-2 
degrees  of  freedom.  The  form  of  the  test  is  to  reject  H  if  the  abso¬ 
lute  value  of  t*  exceeds  a  critical  value  determined  by  the  signifi¬ 
cance  level  of  the  test. 

All  digraphs  for  all  typists  were  compared  using  this  t-test  pro¬ 
cedure,  The  three  texts  were  pooled  and  treated  as  one  text,  because 
there  were  insufficient  frequencies  for  many  digraphs.  A  sample  of 
our  results  is  shown  in  Table  3  for  those  cases  in  which  there  were 
at  least  ten  replications  of  the  digraph  in  both  the  August  and 
December  tests.  For  the  first  entry,  with  a  t-statistic  of  .351, 
when  H  is  true  we  have  P{t*  >  .351}  =  .728.  Such  a  t-statistic  is 
quite  likely  to  have  occurred  by  chance  under  H,  so  H  cannot  be  re¬ 
jected.  The  digraphs  for  which  the  t-statistic  is  considered  signif¬ 
icant  (for  which  the  p-value  is  less  than  .05)  are  indicated  by  an 
asterisk  to  the  right  of  the  entry  in  the  p-value  column.  The  last 
column  in  the  table  shows  which  hands  are  used  to  type  each  digraph. 
For  example,  the  first  digraph,  ir,  is  conventionally  typed  with  a 
finger  of  the  right  hand  followed  by  a  finger  of  the  left  land;  thus, 
the  entry  R-L  denotes  "right"  and  "left,"  respectively.  We  included 
this  information  in  order  to  determine  if  a  hand  pattern  would  emerge 
for  those  digraph  tests  that  were  significant.  We  could  not  find  any 
such  pattern.  Of  the  144  cases  evaluated  for  Typist  2  (only  50  of 
which  are  shown  in  Table  3),  7.6  percent  were  significant.  That  is, 

H  was  rejected  about  8  percent  of  the  time,  which  means  that  the  sub¬ 
ject’s  typing  was  consistent,  from  August  to  December,  on  92  percent 
of  the  digraphs.  The  consistencies  of  the  other  typists  are  shown 
b  elow : 
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Table  3 

RESULTS  OF  T-TESTS  OF  TRANSFORMED  DIGRAPH-TIME  DISTRIBUTIONS  / 
(Typist  2,  144  digraphs,  7.6  percent  significant) 


Replications  Mean  Digraph 

Digraph  Aug  Dec  Aug  Dec 

Times 

Delta3 

Standard  Deviations 

Aug  Dec  Ratio  Pooled 

t-statistic 

p-value 

Hand(s)  used  to 
type  digraph 

ir 

17 

18 

4.687 

4 . 649 

0.038 

0.342 

0.298 

1.147 

0.321 

0.351 

0.728 

R  -  L 

is 

34 

22 

4.816 

4.864 

-0.047 

0.180 

0.229 

0.788 

0.201 

-0.863 

0.392 

R  -  L 

it 

37 

22 

4.821 

4.680 

0.141 

0.452 

0.252 

1.799 

0.391 

1.343 

0.184 

R  -  L 

iv 

21 

21 

4.725 

4.745 

-0.021 

0.160 

0.203 

0.792 

0.183 

-0.367 

0.715 

R  -  L 

ke 

17 

14 

4.652 

4.697 

-0.045 

0.286 

0.289 

0.989 

0.287 

-0.432 

0.669 

R  -  L 

ki 

17 

15 

5.077 

5.089 

-0.012 

0.161 

0.202 

0.797 

0.182 

-0.188 

0.852 

R  -  R 

k- 

13 

11 

4.807 

4.835 

-0.028 

0.148 

0.181 

0.822 

0.164 

-0.412 

0.684 

R  -  R 

la 

30 

25 

4.683 

4.615 

0.068 

0.152 

0.109 

1.395 

0.135 

1.852 

0.070 

R  -  L 

le 

43 

31 

4.745 

4.813 

-0.068 

0.247 

0.218 

1.134 

0.235 

-1*227 

0.224 

R  -  L 

if 

11 

12 

4.887 

5.136 

-0.250 

0.413 

0.448 

0.920 

0.432 

-1.385 

0.181 

R  -  L 

ii 

22 

14 

5.023 

5.038 

-0.015 

0.297 

0.133 

2.240 

0.248 

-0.176 

0.862 

R  -  R 

lu 

17 

14 

5,126 

5.115 

0.011 

0.211 

0.226 

0.9.33 

0.218 

0.142 

0.888 

R  -  R 

ly 

15 

ii 

5.261 

5.221 

0.040 

0.214 

0.155 

1.381 

0.191 

0.527 

0,603 

R  -  R 

i- 

32 

24 

4.614 

4.478 

0.136 

0.383 

0.394 

0.971 

0.388 

1.303 

0.198 

R  -  R 

ma 

22 

11 

4.830 

4.715 

0.114 

0.262 

0.243 

1.076 

0.256 

1.209 

0.236 

R  -  L 

me 

27 

19 

4.691 

4.568 

0.122 

0.111 

0.153 

0.729 

0.130 

3.149 

0.003  * 

R  -  L 

mi 

14 

12 

5.125 

5.117 

0.008 

0.171 

0.162 

1.055 

0.167 

0.125 

0*901 

R  -  R 

mp 

13 

12 

4.967 

5.115 

-0.148 

0.196 

0.293 

0.668 

0.247 

-1.490 

0*150 

R  -  R 

ms 

14 

13 

4.883 

4.851 

0.031 

0.197 

0.226 

0.872 

0,211 

0.385 

0.703 

R  -  L 

m- 

18 

15 

4.790 

4.861 

-0.072 

0.255 

0.256 

0.994 

0.255 

-0.803 

0.428 

R  -  R 

nd 

41 

31 

4.564 

4.569 

-0.004 

0.219 

0.202 

1.084 

0.212 

-0.086 

0.932 

R  -  L 

ng 

33 

17 

4.675 

4.643 

0.033 

0.184 

0.402 

0.457 

0.277 

0.397 

0.693 

R  -  L 

no 

21 

14 

5.198 

5.144 

0.054 

0.151 

0.131 

1.157 

0.143 

1.086 

0.285 

R  -  R 

ns 

24 

13 

4.739 

4.647 

0.092 

0.306 

0.167 

1.832 

0.267 

1.007 

0.321 

R  -  L 

nt 

28 

19 

4.749 

4.631 

0.118 

0.401 

0.254 

1.581 

0.349 

1.135 

0.262 

R  -  L 

n— 

39 

28 

4.914 

4.878 

0.035 

0.170 

0.174 

0.977 

0.171 

0.834 

0.407 

R  -  R 

OC 

10 

10 

4.821 

4.750 

0.072 

0.295 

0.357 

0.826 

0.327 

0.491 

0.629 

R  -  L 

of 

23 

15 

4.622 

4.608 

0.014 

0.121 

0.196 

0.618 

0.154 

0.271 

0.788 

R  -  L 

om 

24 

20 

5.088 

5.071 

0.017 

0.228 

0.228 

0.999 

0.228 

0.240 

0.811 

R  -  R 

on 

42 

32 

5.091 

5.090 

0.001 

0.146 

0.114 

1.280 

0.133 

0.021 

0.983 

R  -  R 

or 

44 

37 

4.694 

4.723 

-0.030 

0.199 

0.296 

0.673 

0.248 

-0.537 

0.593 

R  -  L 

ox 

10 

10 

5.067 

5.065 

0.002 

0.147 

0.120 

1.227 

0.134 

0.041 

0.968 

R  -  L 

o- 

17 

11 

4.727 

4.731 

-0,003 

0.258 

0.218 

1.180 

0.243 

-0.036 

0,971 

R  -  R 

pe 

21 

13 

4.746 

4.710 

0.036 

0.425 

0.208 

2.047 

0.359 

0.282 

0.780 

R  -  L 

pl 

18 

15 

5.285 

5.271 

0.014 

0.153 

0.065 

2.365 

0.122 

0.341 

0.736 

R  -  R 

pr 

19 

16 

4.779 

4.746 

0.033 

0.230 

0.255 

0.900 

0.242 

0.407 

0.687 

R  -  L 

ra 

45 

34 

5.106 

5.140 

-0.034 

0.143 

0.257 

0.558 

0.200 

-0.756 

0.452 

L  -  L 

rd 

17 

14 

5.285 

5.305 

-0.019 

0.142 

0.182 

0.779 

0.161 

-0.330 

0.744 

L  -  L 

re 

41 

33 

5.093 

5.076 

0.017 

0.210 

0,173 

1.215 

0.195 

0.367 

0.715 

L  -  L 

ri 

16 

11 

4.543 

4.546 

-0.003 

0.111 

o.lll 

1.000 

0.111 

-0.068 

0.946 

L  -  R 

rk 

11 

11 

4.620 

4.774 

-0.154 

0.116 

0.177 

0.659 

0.149 

-2.413 

0.026  * 

L  -  R 

rm 

13 

14 

4.653 

4.762 

-0.109 

0.091 

0.319 

0.284 

0.239 

-1.186 

0.247 

L  -  R 

rn 

11 

10 

4.679 

4.809 

-0.131 

0.095 

0.266 

0.357 

0.196 

-1.529 

0.143 

L  -  R 

ro 

23 

13 

4.707 

4.624 

0.083 

0.201 

0.078 

2.563 

0.168 

1.416 

0.166 

L  ~  R 

r- 

58 

47 

4.620 

4.647 

-0.026 

0,149 

0.140 

1.061 

0.145 

-0.931 

0.354 

L  -  R 

se 

34 

24 

5.166 

5.114 

0.052 

0.092 

0.091 

1.003 

0.092 

2.140 

0.037  * 

L  -  L 

sm 

20 

16 

4.938 

4.852 

0.086 

0.291 

0.266 

1.095 

0.281 

0.917 

0.365 

L  -  R 

St 

28 

20 

5.105 

5.085 

0.020 

0.135 

0.151 

0.893 

0.142 

0.488 

0.628 

L  -  L 

sw 

16 

15 

5.403 

5.330 

0.073 

0.165 

0.092 

1.807 

0.135 

1.508 

0.142 

L  -  L 

s- 

74 

48 

4.809 

4.724 

0.085 

0.305 

0.191 

1.593 

0.266 

1.727 

0.087 

L  -  R 

difference  between  August  and  December  means. 
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Percent  of  Cases 

Typist  Significant  Consistency  (percent) 


1 

2 

3 

4 

5 

6 

The  t-tests  of  typing  consistency  over  time  rest  on  several  model 
assumptions.  The  distributional  assumption  of  normality  was  reasonable 
in  light  of  the  fact  that  the  data  were  transformed  until  their  dis¬ 
tribution  was  approximately  normal.  The  variances  of  the  transformed 
digraph-time  distributions  differed  from  one  digraph  to  another,  but 
for  a  given  digraph,  they  varied  very  little  from  August  to  December. 

For  this  reason  we  pooled  the  data  from  the  two  tests  insofar  as 
sample-variance  computations  were  concerned.  It  is  well  known  that 
t-tests  are  quite  insensitive  to  small  deviations  from  normality  and 
small  excursions  of  the  variance  ratio  from  unity,  so  we  felt  quite 
confident  of  the  results  of  our  test.  We  therefore  concluded  that  it 
was  reasonable  to  consider  authentication  procedures  based  on  keystroke 
timing,  since  subjects  are  likely  to  be  sufficiently  consistent  in 
their  typing  patterns  over  time  for  such  a  procedure  to  be  effective. 

The  results  of  our  authentication  analysis  are  summarized  below. 

DEVELOPMENT  OF  AW  AUTHENTICATION  PROCEDURE 

The  statistical  model  developed  for  authenticating  subjects  on  the 
basis  upon  their  keystroke  timing  patterns  is  presented  in  the  Appendix. 
In  the  following,  we  describe  the  results  of  applying  this  statistical 
model  to  the  empirical  data  obtained  in  our  experiment. 

Since  the  digraph  frequencies  in  each  of  the  three  separate  texts 
were  often  very  low  (too  low  to  permit  meaningful  statistical  inferences), 

jSf 

Typist  3  completed  only  the  August  test  for  Text  1  (see  Pig.  IX, 
so  these  results  are  based  upon  only  48  cases.  This  subject  took  the 
tests  unenthusiastically  and  slowed  down  substantially  (but  consis¬ 
tently)  during  her  second  test. 


11.7 

88.3 

7.6* 

92.4 

50.0 

50.0 

4.7 

95.3 

5.5 

94.5 

20.6 

79.4 
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we  pooled  the  data  from  all  texts  typed  by  a  given  subject  during  a 
given  test  into  a  single  sample.  As  indicated  in  Table  2,  this  usually 
meant  that  three  texts  were  pooled,  but  in  two  cases  only  two  texts 
were  pooled,  and  Typist  3  typed  only  one  text  during  the  August  test. 
The  pooling  resulted  in  "reasonably  large"  digraph  frequencies  for  a 
number  of  digraphs  for  all  samples  except  the  August  test  of  Typist  3. 
Since  that  sample  did  not  provide  sufficiently  large  numbers  of  digraph 
replications  for  statistical  inferences  to  be  made,  we  excluded  it  from 
our  subsequent  analyses.  This  left  87  digraphs  for  which  ten  or  more 
replications  were  available  in  each  of  the  remaining  eleven  cases 
(December  for  Typist  3  and  both  August  and  December  for  the  other  five 
typists) . 

Our  first  test  of  authentication  used  all  87  digraphs.  The  com¬ 
bination  of  a  given  subject  and  a  given  time  (say,  Typist  2,  August 
test)  was  used  to  define  the  "originator."  Then  all  the  other  tests, 
in  both  August  and  December,  were  compared  with  the  originator's  test. 
All  these  others  were  considered  "claimants,"  including  Typist  2  in 
the  December  test.  Any  authentication  test  results  other  than  those 
in  which  Typist  2,  December,  was  authenticated  were  considered  errors 
(a  primary  error  if  originator  and  claimant  were  different,  but  the 
procedure  authenticated;  a  secondary  error  if  originator  and  claimant 
were  the  same,  but  the  procedure  did  not  authenticate!.  Since  there 
were  eleven  cases,  there  were  eleven  possible  originators  and  ten 
possible  claimants  for  each  choice  of  originator.  However,  the  roles 
of  originator  and  claimant  were  symmetric  in  our  procedure;  that  is, 
when  comparing  two  samples  it  is  irrelevant  which  of  them  is  labeled 
originator  and  which  is  labeled  claimant,  as  the  results  will  be  the 
same  in  either  case.  Thus  we  had  55  unique  authentication  tests. 

In  each  authentication  test,  a  vector  of  transformed  means  for  the 
87  digraphs  of  the  originator  was  compared  with  a  similar  vector  of 
transformed  means  of  the  same  87  digraphs  of  the  claimant.  In  each 
case,  we  studied  both  the  number  of  primary  and  secondary  errors  made 
and  the  p-value  corresponding  to  the  strength  of  the  55  separate  tests. 

The  results  for  all  tests  showed  no  primary  errors,  although  there 
were  two  secondary  errors  (Typists  1  and  6  were  both  incorrectly  denied 
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access  to  the  computer  when  their  August  tests  were  compared  against 
their  own  December  tests) .  The  questions  at  this  point  were  the  fol¬ 
lowing: 

1.  Is  it  possible  that  the  secondary  errors  could  be  eliminated 
by  eliminating  certain  digraphs?  If  so,  which  digraphs  should 
be  eliminated? 

2.  How  small  a  vector  of  digraphs  can  be  used  for  authentication 
without  any  primary  or  secondary  errors  occurring?  Which  di¬ 
graphs  are  the  "key"  ones? 

We  hypothesized  a  mechanism  that  might  be  generating  the  observed 
differences  in  the  keystroke-timing  patterns,  in  the  hope  that  such  a 
hypothesis  would  serve  as  a  guide  for  eliminating  digraphs  from  the 
8 7- dimensional  vector  (the  alternative  would  have  been  to  study  every 
possible  subset,  a  very  difficult  undertaking).  We  assumed  that  ob¬ 
served  differences  occur  because  of  the  differences  in  finger  dexterity 
and  muscular  coordination  between  subjects.  If  this  is  correct,  it  is 
unlikely  that  using  bogus  digraphs  such  as  (e,  -)  would  contribute  very 
much  to  our  understanding,  since  they  represent  aggregations  over  the 
second  character  in  the  digraph  (see  p.  15)  which  would  be  likely  to 
mask  individual  differences.  We  also  reasoned  that  finger  dexterity 
would  most  likely  be  different  on  different  hands  of  the  same  subject, 
so  we  decided  to  study  authentication  patterns  using  certain  finger 
and  hand  combinations. 

We  first  eliminated  all  digraphs  that  contain  a  space.  When  the 
same  55  authentication  tests  were  carried  out  on  the  remaining  60  di¬ 
graphs,  one  secondary  error  occurred:  Typist  6  was  again  denied  access 
to  the  computer  when  she  should  have  been  authenticated. 

Of  these  60  digraphs,  11  were  made  with  two  right-hand  fingers, 
and  17  were  made  with  two  left-hand  fingers.  The  remaining  32  digraphs 
required  a  different  hand  for  each  of  the  two  characters. 

Authentication  tests  performed  with  only  the  17  left-left  (L-L) 
digraphs  produced  one  primary  error  and  one  secondary  error.  But  when 
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we  used  only  the  11  right-right  (R-R)  digraphs,  we  found  no  errors  of 
either  kind — a  perfect  authentication  record.  Therefore,  we  decided 
to  concentrate  on  R-R  digraphs  for  authentication. 

We  next  addressed  the  question  of  whether  or  not  the  factor  of 

size  11  could  be  reduced.  We  started  by  studying  the  strength  with 

which  authentication  was  carried  out  in  each  of  the  tests  when  par- 

* 

ticular  digraphs  were  deleted  from  the  set  of  11.  This  process  ul¬ 
timately  suggested  a  subset  of  5  R-R  digraphs  with  which  authentica¬ 
tion  could  be  carried  out  for  all  55  tests  without  primary  or  secondary 
errors.  This  subset  comprises  a  core  of  four  "necessary"  digraphs — in, 
io,  no,  on — plus  one  other  digraph  that  could  be  ui,  it,  or  ty ;  that 
is,  the  core  plus  any  one  of  these  three  could  be  used  to  produce  au¬ 
thentication  with  no  errors.  These  digraphs  are  all  typed  using  only 
the  second,  third,  or  fourth  fingers  of  the  right  hand.  Note  also  that 
each  digraph  contains  at  least  one  vowel  (including  y\ . 

In  addition  to  determining  that  our  authentication  procedure  will 
work  without  error  in  all  cases,  it  is  important  to  understand  the 
strength  of  the  procedure.  That  is,  when  the  procedure  authenticates 
in  a  particular  instance,  does  it  do  so  just  barely,  or  does  it  do  so 
with  very  little  question?  When  the  procedure  says  the  claimant  is  an 
impostor,  does  it  give  a  resounding  rejection  or  a  borderline  one? 

Our  authentication  procedure  was  keyed  to  operate  at  a  5  percent 
level  of  significance  (other  significance  levels  can  be  selected  for 
a  given  situation,  but  we  retained  this  level  throughout  our  preliminary 
study  for  convenience  and  consistency) .  This  means  that  when  claimant 
is  an  impostor,  the  procedure  should  authenticate  with  a  p-value  _>  .05. 

On  our  tests  using  the  digraphs  in,  io ,  no,  on,  and  ul,  the  p-values 
were  as  shown  in  Table  4, 

In  our  comparisons,  when  the  p-value  should  have  been  large  (_>  .05), 
it  actually  was  very  large,  in  all  cases  except  that  of  Typist  6  versus 
Typist  6,  where  the  p— value  was  only  .078,  In  the  opposite  situation, 
where  we  wanted  a  small  p— value  (<  .05),  it  was  very  small  in  all  cases. 


& 

We  also  studied  the  rankings  for  each  typist  for  each  digraph 
and  found  cases  where  the  ranks  for  a  particular  digraph  differed 
strongly  across  subjects.  Such  a  digraph  was  considered  a  candidate 
for  retention,  and  others  were  rejected. 
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Table  4 

RESULTS  OF  AUTHENTICATION  PROCEDURE  USING 
DIGRAPHS  IN,  10,  NO,  ON,  AND  UL 


Case 

(Typist/Test) 

p-value3 

Case 

(Typist /Test) 

p-value^ 

1/Aug 

VS  . 

1/Dec 

.304 

1 

vs . 

all 

others 

.017 

2/ Aug 

vs . 

2/Dec 

.321 

2 

vs . 

all 

others 

.001 

— 

— 

3 

vs . 

all 

others 

.001 

4/Aug 

vs . 

4/Dec 

.977 

4 

vs . 

all 

others 

.000 

5 /Aug 

vs . 

5/Dec 

.150 

5 

vs . 

all 

others 

.017 

6/ Aug 

vs . 

6/Dec 

.078 

6 

vs. 

all 

others 

.004 

aShould  be  >  .05, 
^ Should  be  <  .05. 


The  weakest  case  was  that  of  Typist  6  versus  Typist  6,  where  the  p-value 
was  .078.  Thus,  the  procedure  worked  quite  well  at  the  95  percent 
confidence  level.  In  some  situations,  90  percent  confidence  or  less 
would  be  adequate,  while  in  other,  critical  situations,  99.999  percent 
or  more  might  be  required.  It  is  likely  that  situations  requiring  high 
levels  of  confidence  would  require  very  sophisticated  digraph  combina¬ 
tions  (or  possibly  trigraphs  or  tetragraphs) . 

We  do  not  yet  fully  understand  why  the  particular  digraphs  we 
studied  appear  to  be  the  key  discriminators  among  our  small  sample  of 
subjects,  and  of  course  we  do  not  yet  know  whether  these  digraphs  would 
serve  us  as  well  in  a  new,  different,  and  larger  sample.  These  pre¬ 
liminary  results  are  sufficiently  promising,  however,  to  make  us  very 
hopeful  for  positive  results  in  related  research  in  the  future.  Be¬ 
cause  the  two  left-handed  subjects  in  our  sample  were  "nonfamilial, " 


The  significance  level  in  Table  4  could  have  been  anywhere  from 
.017  to  .078  (instead  of  .05),  and  the  same  results  would  have  been 
obtained,  i.e.,  there  would  have  been  no  errors  of  authentication. 
This  corresponds  to  a  confidence-level  variation  of  from  92.2  to  98.3 
percent.  This  set  of  five  digraphs  was  our  best  case  in  terms  of  the 
possible  range  of  error-free  confidence-level  variation. 
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that  is,  left-handedness  does  not  "run"  in  either  of  their  families, 
we  believe  that  the  organization  of  the  cerebral  hemispheres  of  their 
brains  is  similar  to  that  of  right-handed  people  (see  Hardyck  et  al., 
1977).  That  is,  the  left  hemispheres  of  their  brains  probably  control 
the  typing  patterns  that  are  likely  to  be  most  subject-specific  in  the 
right  hand.  It  is  therefore  not  surprising  that  all  six  subjects  could 
be  authenticated  with  R-R  digraph  combinations  only.  Future  samples 
should  include  familial  left-handers  as  well,  to  determine  whether  L-L 
or  L-R  and  R-L  digraphs  will  also  be  required  for  authentication. 
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V.  CONCLUSIONS 


The  results  obtained  so  far  in  this  study  have  been  very  gratify¬ 
ing.  However,  our  explorations  into  this  important  area  of  research 
are  very  preliminary,  and  our  conclusions  are  based  upon  a  small  and 
imperfect  sample  of  data.  Therefore,  we  must  qualify  them  in  many  ways. 

Nevertheless,  preliminary  analysis  strongly  suggests  that  there  is 
indeed  a  typing  "signature";  that  is,  professional  typists  really  do 
appear  to  have  distinguishable  "styles"  of  typing,  as  measured  by  pat¬ 
terns  of  expected  times  to  type  certain  digraphs. 

The  second,  and  certainly  subsidiary,  conclusion  of  this  study  is 
that  with  the  statistical  authentication  procedure  we  have,  developed, 
the  five  digraphs  ire,  %o,  no,  on,  and  ul  are  sufficient  to  distinguish 
right-handed  touch  typists  from  one  another  in  a  reliable  way.  This 
result  must  of  course  be  validated  on  new  samples  of  much  greater  size, 
and  for  less  expert  typists.  We  are  cautiously  optimistic  that  further 
experimentation  will  corraborate  our  preliminary  findings. 
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Appendix 

DERIVATION  OF  THE  STATISTICAL  MODEL  FOR  AUTHENTICATION 


DEFINITION  OF  THE  PROBLEM 

Let  X:(rxl)  denote  an  rxl  random  vector  of  observable  character¬ 
istics  corresponding  to  the  key stroke- timing  performance  of  the  origin¬ 
ator,  and  Y:(rxl)  denote  an  analogous  vector  for  the  claimant.  Assume 
X  follows  the  normal  probability  distribution  with  mean  0Q  and  covari¬ 
ance  matrix  D, 


"o  -  W0>D>- 

and  that  Y  follows  the  distribution 

nc  =  N(6c,D), 

where  D  denotes  the  diagonal  matrix 

D  =  diag  (g^  c>2). 

The  raw  digraph  times  corresponding  to  the  originator  and  claimant  are 
assumed  to  have  been  mathematically  transformed  until  they  satisfy  the 
above  assumptions.  Suppose  a  sample  of  size  M  is  available  from  HQ 
for  the  originator,  and  a  sample  of  size  N  is  available  from  II  for 
the  claimant.  That  is,  we  have  available  independent  observation 
vectors  (x^,..,,x^)  and  (y^,...y^),  the  two  sets  are  assumed  indepen¬ 
dent,  and  the  x.'s  follow  IT  ,  while  the  y,  1  s  follow  II  .  The  authen- 
j  o’  J k  c 

tication  problem  is  now  one  of  hypothesis  testing  (in  a  classical  sta¬ 
tistical  sense)  in  that  we  wish  to  test  the  hypothesis  that  11^  =  II  , 
versus  the  alternative  hypothesis  that  II  II  . 

If  the  originator  and  the  claimant  are  statistically  the  same 
person,  we  will  conclude  that  the  keystroke- timing  characteristics  of 
the  claimant  are  sufficiently  similar  to  those  of  the  originator  that 
we  are  inclined  to  conclude  with  a  high  degree  of  confidence  that  such 
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keystroke  patterns  were  most  likely  generated  by  the  same  individual. 

If  a  test  of  hypotheses  suggests  that  II  /  II  ,  we  should  conclude  that 
the  keystroke- timing  patterns  of  the  originator  and  claimant  are  suffi¬ 
ciently  dissimilar  that  the  subjects  are  most  likely  different  people. 
(An  alternative  explanation  for  an  observed  difference  is,  of  course, 
that  the  originator  and  claimant  are  actually  the  same  person,  but  for 
some  reason,  the  keystroke  timing  "signature"  of  the  subject  has  under¬ 
gone  a  structural  change.) 

Using  conventional  statistical  notation,  we  will  test  the  hypothe¬ 
sis 


H:  0Q  =  9  ,  D  >  0,  D  is  diagonal 


against  the  hypothesis 

A:0  ^  6  ,  D  >  0 ,  D  is  diagonal 

o  c 

where  the  notation  D  >  0  means  that  the  matrix  D  is  assumed  to  be  any 
positive  definite  (symmetric)  matrix  (and  of  course,  in  this  instance, 
it  must  be  diagonal  as  well) . 


REDUCTION  TO  CANONICAL  FORM 

We  now  put  the  problem  into  canonical  form  by  first  going  to  suf¬ 
ficient  statistics.  Define  the  sample  means  and  variances 


where 


-  i‘  -  i  N 
x  =  —  l  x  ,  y  = 


M 


N 


£  y,  > 
l  J 


2  M  2  N 

V=  2  (xki  "  xk}  +  1  ^ 

1=1  J=1 


V 


5  <V’ 

=  (V,  y  =  (yk); 


X 
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and  let  v  denote  the  vector  of  sample  sums  of  squares: 


v 

(rxl) 


Note  that  (x,  y,  v)  Is  sufficient  for  (,0  ,  0  ,  D) . 

The  distributions  of  the  sufficient  statistics  are  well  known. 

Since 


£(x)  =  N(0o,  |),  £(y)  =  N(0c,  |),  £(x-y)  =  N(0o  -  0^  tD), 

where  X  =  (M  ^  +  N  ,  and  (•)  denotes  the  probability  law  of  the 
quantity  in  parenthesis. 

Note  that 


where  v  =  M  +  N  -  2.  The  problem  may  now  be  rewritten  in  the  more  com¬ 
pact  form: 


£(z)  =  N(<j> ,  D) ,  Z(v2/a2)  =  , 


where 


z 


-  -  0-0 
x  -  y  _  o  c 

/t  ’  ~  /t 


and  the  problem  is  to  test 


H:  (j>  =  0,  D  >  0,  vs.  A:  <j>  #  0,  D  >  0. 

Clearly  (z,v)  is  sufficient  for  (q>,D)  ;  also,  it  is  well  known  that 
z  and  v  are  stochastically  independent. 
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LIKELIHOOD-RATIO  TEST 

Let  fi  denote  the  parameters  in  the  canonical  problem,  so  that 


2  2 

=  ($;»)  =  . . .  ,a  ) . 


The  joint  density  of  the  sufficient  statistics  is  given  by 


where 


f(z,v|fi)  =  f1(z[fi)  f2(v|ft). 


f,(z|ft)  =Jd|  exp  (-1/2)  {  (z-^)  ’D  1(z-(j))}) 


f?(v|fi)  =  n  g(Vj  | op, 


j=l 


M?  -  1) 


(vt).  2 


g(v.  a")11 
J  3 


(°P 


exp (-1/2) 


2 

v. 


The  notation  <=  means  "is  proportional  to,"  the  prime  denotes  a  trans 
posed  matrix,  and  j  D |  denotes  the  determinant  of  D.  Combining  terms 
shows  that  we  can  write 


2  2 

f  (z  ,  V  |  q)  oc  n  h.  (z.  ,v.  [  d> .  ,CT  .  )  , 
j  J  J  <3  2  3 


where 


z  -  (Zj),  V  e  (v. ) ,  Ct,  =  (<j,  ),  and 


,  2  (2 

2 1  2  (v-i ) 

i.  (z  .  ,V.  A  .  CT  )  =  - j - - - _ 

i  j  j 1  j’  y  exp 


<0d> 


0+1)/ 2 


,2o. 


2  9 

V,  +  (z,  -  <A,)Z 

J  j  j 
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The  likelihood  ratio  statistic  (LRS)  for  testing  H  versus  A  is 
defined  as 


max  f(z,v|jj) 

A  =  _H _ 

max  f(z,v|$l) 
HUA 


It  is  straightforward  to  check  that  in  this  case,  A  is  given  by 


r 

n  [i  + 

j=i 


2.  2 1  (v+1 )  /  2 

W 


The  test  is  to  reject  H  if  A  is  too  small,  i.e.,  reject  H  if  A  <  C  , 
where  C*  denotes  some  constant  that  must  still  be  determined. 


DISTRIBUTION  OF  LRS  UNDER  H 
Define 


2/ (v+1) 


r  2  2-1 
[11(1  +  zf/OJ 
3  J 


Then,  an  equivalent  test  is  to  reject  H  if  II  <  C,  where  C  is  some 
unknown  constant  that  must  be  determined. 

Now  note  from  the  above  distributional  statements  that  under  H, 


where  and  ^  are  independent.  Thus,  from  a  distributional  stand¬ 
point,  we  may  write 


n  Z  , 
3=1  3 


U  = 
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where 


£(z.) 

1 


Next  note  that  TtZ^)  =  ^) >  where  3(a,b)  denotes  a  beta  distribu¬ 

tion  with  density 


p (x| a  ,b) 


1 

B(a,b) 


a-1 

x 


(1-x) 


b-1 

3 


0  <  x  <  1,  a  >  0,  b  >  0, 


and  is  0  otherwise;  and  B(a,b)  denotes  a  beta  function.  Thus,  the  test 
statistic  U  is  distributed  under  H  as  the  product  of  independent  beta 
variates  with  identical  degrees  of  freedom. 


APPROXIMATE  DISTRIBUTION  OF  LRS  UNDER  H 

The  exact  distribution  of  a  product  of  independent  beta  variates 
is  very  complicated.  We  therefore  propose  below  an  approximation  which 
is  adequate  for  our  purposes.  This  approximation  involves  replacing 
the  product  of  independent  beta  variates  by  a  single  beta  variate  that 
has  the  same  first  two  moments  (see  Tukey  and  Wilks,  1946). 

Accordingly,  assume 


£(U)  e  6(y,d), 


where  (y,6)  are  degrees-of-f reedom  parameters  that  will  be  determined 
in  terms  of  the  known  constants  (v,r) .  Note  that  since 

r 

U  =  Hz., 

j=l  J 

and  because  it  is  well  known  that 


E(U)  = 


y+<5 


•y+6 


ECnzj  =  rez,  = 


r 

n 

l 


v/2 


v/2  +  1/2 
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Similarly,  since 

r  r 

E(u2)  =  E(n  z2)  =n  E(Z2), 

1  J  x  3 


CD 


and  it  is  well  known  that 


E(Z2) 

3 


r  <^)r  (*£) 


r(T} 


(V~^) 


v  (v+2) 
(v+i)(v+3)  ’ 


r 

Y  (y+1)  _  y  (y+2) 

(y+<S+l)  (y  +6  )  (v+1)  (v+3) 


(2) 


Equations  (1)  and  (2)  must  now  be  solved  simultaneously  for  (y,5), 
for  fixed  (v,r).  Define  the  constants 


\.+lJ  ’ 


v  (v+2) 

(v+1)  (v+3) 


r 


It  is  straightforward,  though  tedious,  to  show  that 


w1(w1-w2)  (1-Wj^)  (w^w^) 

y  =  2~  ,  6  =  - — j -  .  (3) 

<W2  "  V  <w2  ~  V 

It  is  also  straightforward  to  check  that  y  >  0,  6  >  0. 

AUTHENTICATION  TEST 

The  test  for  authentication  is  the  test  of  hypothesis  H  versus 
A.  That  is,  if  we  cannot  reject  H,  we  conclude  that  the  claimant 
should  be  authenticated;  otherwise,  we  conclude  that  the  claimant  is 
an  impostor.  The  test  of  H  versus  A  developed  above  is  to  reject  H 
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if  U  <  C,  where  C  was  not  yet  determined.  Now  we  know  that  under 
H,  it  is  approximately  true  (for  all  sample  sizes)  that  £(U)  =  B(y,6). 
Therefore, 


PfU  <  C | H}  =  F (C) , 

where  F(C)  denotes  the  cumulative  distribution  function  of  a  beta  vari¬ 
ate  with  (y,5)  degrees  of  freedom;  i.e., 

F(c)  ‘  MTTff  fo°  xirl(1-x)6"ldx  ' 

F(C)  is  also  known  as  an  incomplete  beta  function.  Let  a  -  F(.C)  denote 
the  level  of  significance  of  the  test  of  the  hypothesis.  If  a  is  pre¬ 
specified  according  to  the  level  of  risk  the  decisionmaker  is  willing 
to  take  (the  size  of  a  will  vary  according  to  the  context  of  the  prob¬ 
lem)  ,  since  F(C)  is  a  monotone  function  of  its  argument,  C  will  be 
uniquely  determined. 

The  test  for  authentication  now  becomes;  Do  not  authenticate  if 
U  <  C,  and  authenticate  if  U  _>  C,  where 


and 


U  = 


n  (l  +  tf) 

k=l 


<\  -  v 


v. 


,MN  ' 


v,  = 


"  -  2  N  -  2 

Z(xkl  -  V  +  -  V 


i=l 
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